Assigment-1

setwd("~/Downloads/M A S T E R S /Sem-3/Part-1/Visualization/labs/lab-4")
data = read.delim("prices-and-earnings.txt",header = TRUE)

Q.1)

col_extract <- c(1,2,5,6,7,9,10,16,17,18,19) 

data <- data[col_extract]

rownames(data) <- data[,1]

data<- data[,-1]

Q.2)

library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
data_scaled=scale(data)

plot_ly(x=colnames(data_scaled), y=rownames(data_scaled), 
        z=data_scaled, type="heatmap", colors =colorRamp(c("yellow", "red")))

Based on the above heatmap, we dont see any distinct clusters i.e area with similar color intensity. We can only though see outliers that are colored in dark red and bright yellow that are spread across the heatmap.

Q.3)

#a)

library(seriation)
## Warning: package 'seriation' was built under R version 4.3.3
rowdist <- dist(data_scaled)
coldist <- dist(t(data_scaled))

order1 <- seriate(rowdist, method = "HC_complete")
order2 <- seriate(coldist, method = "HC_complete")
ord1 <- get_order(order1)
ord2 <- get_order(order2)

reordmatr<-data_scaled[rev(ord1),ord2]


plot_ly(x=colnames(reordmatr), y=rownames(reordmatr), 
        z=reordmatr, type="heatmap", colors =colorRamp(c("yellow", "red")))

#b)

rowdist_2 <- as.dist(1-cor(t(data_scaled))) #tranpose here cause cor() finds the correlation between columns
coldist_2 <- as.dist(1-cor(data_scaled))


order1_cor <- seriate(rowdist_2, method = "HC_complete")
order2_cor <- seriate(coldist_2, method = "HC_complete")
ord1_cor <- get_order(order1_cor)
ord2_cor <- get_order(order2_cor)

reordmatr_2<-data_scaled[rev(ord1_cor),ord2_cor]


plot_ly(x=colnames(reordmatr_2), y=rownames(reordmatr_2), 
        z=reordmatr_2, type="heatmap", colors =colorRamp(c("yellow", "red")))

Based on the 2 heatmaps, plot a) is easier to analyse since the intensity of similar colors is less spread out compared to plot b) and most are clustered around the same area. Additionally, we can see a more distinct partition of the heatmap into 4 quadrants making it easier to analyse each of them separately and also understand how the color scheme switches from one quadrant to another.

Analysis of plot a):

We notice 2 clusters of cities mentioned on the top and bottom. If we look further into these cities it suggests that cities on the top are developed cities (Zurich, Luxembourg etc.) and the cities in the bottom are developing/underdeveloped (Nairobi, Cairo etc.).

Let us name developed cities as cluster-1 and the underdeveloped cities as cluster-2 based on the heatmap.

For cluster-1, we see that for variables like “Big.Mac.Min” , “iphone.4s.hr” , “Rice.kg.in.min” it takes lesser time compared to the average. While for other variables like “Goods and Services”, “Wage.Net,”Food.Costs” are higher compared to the average. And the basically the opposite can be said for cluster-2. This insights also reaffirms to the point that cluster-1 comprises of developed cities and cluster-2 comprises of developing/underdeveloped cities. And there is a direct correlation within these 2 clusters for different attributes mentioned above i.e they follow the same patterns for those attributes.

Q.4)

order1_tsp <- seriate(rowdist, method = "TSP")
order2_tsp <- seriate(coldist, method = "TSP")
ord1_tsp <- get_order(order1_tsp)
ord2_tsp <- get_order(order2_tsp)

reordmatr_3<-data_scaled[rev(ord1_tsp),ord2_tsp]

plot_ly(x=colnames(reordmatr_3), y=rownames(reordmatr_3), 
        z=reordmatr_3, type="heatmap", colors =colorRamp(c("yellow", "red")))
print("HC solver using Euclidean distance:")
## [1] "HC solver using Euclidean distance:"
cat("\n")
rbind(
    unordered = criterion(rowdist),
    ordered = criterion(rowdist, order1)
)
##                2SUM AR_deviations AR_events      BAR Gradient_raw
## unordered 1012004.5     107139.67     61656 29259.38        -4032
## ordered    877875.7      60436.53     44150 20869.53        30980
##           Gradient_weighted  Inertia Lazy_path_length Least_squares      LS
## unordered         -11081.08 17886336        10126.431       3575435 1006127
## ordered            70463.10 20463445         4628.197       3466709  951764
##           MDS_stress       ME Moore_stress Neumann_stress Path_length      RGAR
## unordered  0.7175256 568.2673     986.9925       553.9889    281.7269 0.5169014
## ordered    0.5951122 642.9323     466.0505       267.6109    138.9674 0.3701375
##                  Rho
## unordered 0.04186156
## ordered   0.37948350
cat("\n")
print("HC solver using one minus correlation:")
## [1] "HC solver using one minus correlation:"
cat("\n")
rbind(
    unordered = criterion(rowdist),
    ordered = criterion(rowdist, order1_cor)
)
##                2SUM AR_deviations AR_events      BAR Gradient_raw
## unordered 1012004.5     107139.67     61656 29259.38        -4032
## ordered    821378.1      42677.84     37697 21529.61        43886
##           Gradient_weighted  Inertia Lazy_path_length Least_squares        LS
## unordered         -11081.08 17886336        10126.431       3575435 1006126.8
## ordered           108517.86 22474998         5322.491       3415970  926394.1
##           MDS_stress       ME Moore_stress Neumann_stress Path_length      RGAR
## unordered  0.7175256 568.2673     986.9925       553.9889    281.7269 0.5169014
## ordered    0.5283662 618.7282     630.2581       346.4586    168.6820 0.3160379
##                  Rho
## unordered 0.04186156
## ordered   0.46695572
cat("\n")
print("TSP:")
## [1] "TSP:"
cat("\n")
rbind(
    unordered = criterion(rowdist),
    ordered = criterion(rowdist, order1_tsp)
)
##                2SUM AR_deviations AR_events      BAR Gradient_raw
## unordered 1012004.5     107139.67     61656 29259.38        -4032
## ordered    897657.4      61723.42     47114 20235.28        25052
##           Gradient_weighted  Inertia Lazy_path_length Least_squares        LS
## unordered         -11081.08 17886336        10126.431       3575435 1006126.8
## ordered            67268.48 20117340         4388.597       3470969  953893.7
##           MDS_stress       ME Moore_stress Neumann_stress Path_length      RGAR
## unordered  0.7175256 568.2673     986.9925       553.9889    281.7269 0.5169014
## ordered    0.6003779 651.4801     422.0436       242.2188    121.5289 0.3949866
##                  Rho
## unordered 0.04186156
## ordered   0.35281176

Based on the 2 plots, the plot produced by the HC solver using Euclidean distance is relatively better to analyse. However the TSP plot is better than than the plot produced by the HC solver using one minus correlation.

As we can see the value of the Path length for TSP is lower than that of the others which suggests that it is more efficient. The path length for TSP is 123.5 while for HC solver using Euclidean distance is 138.9 so there is not a drastic difference and that can also be seen in the plots.

And since we want to a lower gradient value, the lowest one is for TSP too.

Q.5)

library(plotly)


#Plotly - can not see observation ID
data=data


p <- data %>%
  plot_ly(type = 'parcoords', 
          line = list(color = ~ifelse(iPhone.4S.hr. <= 50, 1, 2),  # Use numeric values for color assignment
                colorscale = list(c(1, 2), c('red', 'blue'))),
          
          dimensions = list(
            list(label = 'Food Costs', values = ~Food.Costs...),
            list(label = 'iPhone 4S.hr', values = ~iPhone.4S.hr.),
            list(label = 'Clothing Index', values = ~Clothing.Index),
            list(label = 'Hours Worked', values = ~Hours.Worked),
            list(label = 'Wage Net', values = ~Wage.Net),
            list(label = 'Vacation Days', values = ~Vacation.Days),
            list(label = 'Big Mac min', values = ~Big.Mac.min.),
            list(label = 'Bread kg in min', values = ~Bread.kg.in.min.),
            list(label = 'Rice kg in min', values = ~Rice.kg.in.min.),
            list(label = 'Goods and Services', values = ~Goods.and.Services...)
          )
  )

p

For the dark color lines, the variables that are important to define this cluster are: “iPhone 4S.hr”, “Big Mac min”, “Bread kg in min”, “Rice kg in min” and “Goods and Services”

Values: iPhone 4S.hr -> Less than or equal to 50 Big Mac min -> Less than or equal to 20 Bread kg in min -> Less than or equal to 20 Rice kg in min -> Less than or equal to 15 Goods and Services -> greater than or equal to 3000

Based on the above values, this cluster resembles that of a developed country.

For the yellow color lines, the variables that are important to define this cluster are: “Wage Net”, “Clothing Index”, “Food Costs” and “Goods and Services”

Values: Wage Net -> Less than or equal to 35 Clothing Index -> Less than or equal to 70 Food Costs -> Less than or equal to 400 Goods and Services -> Less than or equal to 2500

Based on the above values, this cluster resembles that of a underdeveloped/developing country.

One prominent outlier is the only one that had the food costs value more than 900, while the rest are all below 700. This particular outlier also has very high values (the highest in some) for attributes like “Clothing Index” and “Goods and Services”. While having very low values for attributes like “iPhone 4S.hr” and “Big Mac min”. This suggests that this particular outlier is a highly developed country.

Q.6)

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on the plot, the 2 cluster:

  1. Top left, which all look to have similar shapes.

  2. Bottom left, we can see similar shapes here too.

A distinct outlier is the Tokyo which covers a huge area around “Costs”, “Wage Net”, “Clothing Index”, “Goods and Services” etc. While it covers little to no area in the rest of the attributes.

Q.7)

Based on the 3, the heatmap was the best to analyze the data and easily get meaningful insights. The parallel coordinates plot reaffirmed some insights that I was able to easily get from the heatmap. The parallel coordinates plot also doesnt proivide row value names i.e city so futher insights could not be drawn based on this information.

The heatmap allowed for a more efficiently way of looking at the plot based on the color gradient making it easier to see how the color changes across the plot which helped see the direction of the color gradients especially after reordering the data. This helped to identify attributes and the corresponding rows that followed a trend.

The radar plot was definitely hardest to analyze among the 3 due to the number of shapes and each being unique so it had to be looked it much detial to see patterns in the shapes.